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Abstract. The main stated contribution of the Deformable Parts Model 
(DPM) detector of Felzenszwalb et al. (over the Histogram-of-Oriented- 
Gradients approach of Dalai and Triggs) is the use of deformable parts. 
A secondary contribution is the latent discriminative learning. Tertiary 
is the use of multiple components. A common belief in the vision com- 
munity (including ours, before this study) is that their ordering of contri- 
butions reflects the performance of detector in practice. However, what 
we have experimentally found is that the ordering of importance might 
actually be the reverse. First, we show that by increasing the number of 
components, and switching the initialization step from their aspect-ratio, 
left-right nipping heuristics to appearance-based clustering, considerable 
improvement in performance is obtained. But more intriguingly, we show 
that with these new components, the part deformations can now be com- 
pletely switched off, yet obtaining results that are almost on par with 
the original DPM detector. Finally, we also show initial results for using 
multiple components on a different problem - scene classification, sug- 
gesting that this idea might have wider applications in addition to object 
detection. 



1 Introduction 

Consider the images of category horse in Figure [I] (rowl) from the challenging 
PASCAL VOC dataset [9]. Notice the huge variation in the appearance, shape, 
pose and camera viewpoint of the different horse instances - there are left and 
right-facing horses, horses jumping over a fence in different directions, horses 
carrying people in different orientations, close-up shots, etc. How can we build a 
high-performing sliding-window detector that can accommodate the rich diver- 
sity amongst the horse instances? 

Deformable Parts Models (DPM) have recently emerged as a useful and pop- 
ular tool for tackling this challenge. The recent success of the DPM detector 
of Felzenszwalb et al., [2 has drawn attention from the entire vision commu- 
nity towards this tool, and subsequently it has become an integral component 
of many classification, segmentation, person layout and action recognition tasks 
(thus receiving the lifetime achievement award at the PASCAL VOC challenge). 

Why does the DPM detector [2 perform so well? As the name implies, the 
main stated contribution of [2 over the HOG detector described in pQ is the 
idea of deformable parts. Their secondary contribution is latent discriminative 
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Aspect-ratio split [213] Poselet split 4 




Viewpoint split [516] Taxonomy split |7I8| 

mi urn ill 

Visual Subcategories (this paper) 

Fig. 1. The standard monolithic classifier is trained on all instances together. 
Viewpoint split partitions the training data using ground-truth viewpoint an- 
notations into left, right, and frontal subcategories. Poselets clusters the in- 
stances based on ground-truth keypoint annotations in the configuration space. 
Taxonomy split groups instances into subordinate categories using a human- 
defined semantic taxonomy. Aspect-ratio split uses a very simple bounding box 
aspect-ratio heuristic. Visual subcategories are obtained using (unsupervised) 
appearance-based clustering (top: few examples, bottom: mean image) 




How important are "Deformable Parts" in the Deformable Parts Model? 3 

learning. Tertiary is the idea of multiple components (subcategories). The idea 
behind deformable parts is to represent an object model using a lower-resolution 
'root' template, and a set of spatially flexible high-resolution 'part' templates. 
Each part captures local appearance properties of an object, and the deforma- 
tions are characterized by links connecting them. Latent discriminative learning 
involves an iterative procedure that alternates the parameter estimation step 
between the known variables (e.g., bounding box location of instances) and the 
unknown i.e., latent variables (e.g., object part locations, instance- component 
membership). Finally, the idea of subcategories is to segregate object instances 
into disjoint groups each with a simple (possibly semantically interpretable) 
theme e.g., frontal vs profile view, or sitting vs standard person, etc, and then 
learning a separate model per group. 

A common belief in the vision community is that the deformable parts is the 
most critical contribution, then latent discriminative learning, and then subcat- 
egories. Although the ordering somewhat reflects the technical novelty (interest- 
ingness) of the corresponding tools and the algorithms involved, is that really 
the order of importance affecting the performance of the algorithm in practice? 

What we have experimentally found from our analysis of the DPM detector 
is that the ordering might actually be the reverse! First, we show that (i) by 
increasing the number of subcategories in the mixture model, and (ii) switching 
from their aspect-ratio, left-right flipping heuristics to appearance-based clus- 
tering, considerable improvement in performance is obtained. But more intrigu- 
ingly, we show that with these new subcategories, the part deformations can be 
completely turned off, with only minimal performance loss. These observations 
together highlight that the conceptually simple subcategories idea is indeed an 
equally important contribution in the DPM detector that can potentially alle- 
viate the need for deformable parts for many practical applications and object 
classes. 

2 Understanding Subcategories 

In order to deal with significant appearance variations that cannot be tackled 
by the deformable parts, [2] introduced the notion of multiple components i.e., 
subcategories into their detector. The first version of their detector [10] only 
had a single subcategory. The next version [2 had two subcategories that were 
obtained by splitting the object instances based on aspect ratio heuristic. In the 
latest version [11 , this number was increased to three, with each subcategory 
comprising of two bilaterally asymmetric i.e., left-right flipped models (effectively 
resulting in 6 subcategories). The introduction of each additional subcategory 
has resulted in significant performance gains (e.g., see slide 23 in [12]). 

Given this observation, what happens if we further increase the number of 
subcategories in their model? In Section [4] we will see that this does not trans- 
late to improvement in performance. This is because the aspect-ratio heuristic 
does not generalize well to a large number of subcategories, and thus fails to 
provide a good initialization. Nonetheless, it is possible to explore other ways to 
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generate subcategories. For example, subcategories for cars can be based either 
on object pose (e.g., left-facing, right-facing, frontal), or car manufacturer (e.g., 
Subaru, Ford, Toyota), or some functional attribute (e.g., sports car, utility ve- 
hicle, limousine). Figure [I] illustrates a few popular subcategorization schemes 
for horses. 

What is it that the different partitioning schemes are trying to achieve? A 
closer look at the figures reveals that they are trying to encode the homogeneity 
in appearance. It is the visual homogeneity of instances within each subcategory 
that simplifies the learning problem leading to better-performing classifiers (Fig- 
ure.^. What this suggests is, instead of using semantics or empirical heuristics, 
one could directly use appearance-based clustering for generating the subcate- 
gories. We use this insight to define new subcategories in the DPM detector, and 
refer to them as visual subcategories (in contrast to semantic subcategories that 
involve either human annotations or object-specific heuristics). 




Visual Subcategories Semantic Subcategories 

Fig. 2. A single linear model cannot separate the data well into two classes. (Left) 
When similar instances (nearby in the feature space) are clustered into subcategories, 
good models can be learned per subcategory, which when combined together separate 
the two classes well. (Right) In contrast, a semantic clustering scheme also partitions 
the data but leads to subcategories that are not optimal for learning the category- level 
classifier. 

Related Work The idea of subcategories is inspired by works in machine learn- 
ing literature |13|14|15|16|17|18] that consider solving a complex (nonlinear) clas- 
sification problem by using locally linear classification techniques. Several com- 
puter vision approaches have explored different strategies for generating subcat- 
egories. In p~9l5j?], viewpoint annotations associated with instances were used 
to segregate them into separate left, right, frontal sub-classes. In [3], the size 
(height) of detection windows was used to cluster them into near and far-scale 
sub-classes. In [21 , co- watch features are used to group videos of a specific cate- 
gory into simpler subcategories. In instances are clustered into poselets using 
keypoint annotations in the configuration space. In [7], subordinate categories 
of a basic-level category are constructed using human annotations. 

The concept of subcategories has also received significant attention in cogni- 
tive psychology |22|23j . In the seminal work of [23 , the idea of prototypes was 
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introduced. The prototype concept relies on the notion of typicality: its resem- 
blance to the other members of the category and its differences to the members 
of other categories. 

Closely related are also the recently popular exemplar-based methods |5|24j . 
While in a global strategy, a single classifier is trained using all instances be- 
longing to a class as positives, in the case of exemplar-based methods, a separate 
classifier is learned for each individual instance. Although promising results have 
been demonstrated, exemplar methods are prone to overfitting since too much 
emphasis is often placed on local irregularities in the data [25]. The global and 
local learning strategies sit at two extremes of a large spectrum of possible com- 
promises that exploit information from labeled examples. This paper explores 
intermediate points of this spectrum. 

3 Learning Subcategories 

We first briefly review the key details of using subcategories in the DPM detector, 
and then explain the details specific to their use in our analysis. 

Given a set of n labeled instances (e.g., object bounding boxes) D — (< 
o^i , i/i >,...,< x ni y n >), with yi G { — 1,1}, the goal is to learn a set of K 
subcategory classifiers to separate the positive instances from the negative in- 
stances, wherein each individual classifier is trained on different subsets of the 
training data. The assignment of instances to subcategories is modeled as a la- 
tent variable z. This binary classification task is formulated as the following 
(latent SVM) optimization problem that minimizes the trade-off between the I2 
regularization term and the hinge loss on the training data [2 : 

^ K n 

argmin-^II^H 2 + C^e i5 (1) 
w Z k=i <=i 

yi .s? >l-ei, e z ^0, (2) 

Z{ = argmaxsf , (3) 

k 

= w k 4 k {xi) + b k . (4) 

The parameter C controls the relative weight of the hinge-loss term, w k denotes 
the separating hyperplane for the kth subclass, and (pk(-) indicates the cor- 
responding feature representation. Since the minimization is semi-convex, the 
model parameters w k and the latent variable z are learned using an iterative 
approach [2]. 

Initialization As mentioned earlier, a key step for the success of latent sub- 
category approach is to generate a good initialization of the subcategories. Our 
initialization method is to warp all the positive instances to a common feature 
space </>(.), and to perform unsupervised clustering in that space. In our ex- 
periments, we found the Kmeans clustering algorithm using Euclidean distance 
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function to provide a good initialization. 



Calibration One difficulty in merging subcategory classifiers during the testing 
phase is to ensure that the scores output by individual SVM classifiers (learned 
with different data distributions) are calibrated appropriately, so as to suppress 
the influence of noisy ones. We address this problem by transforming the out- 
put of each SVM classifier by a sigmoid to yield comparable score distribu- 
tions j26l27j ^Fi gure 3b. Given a thresholded output score sf for instance i in 
subcategory k, its calibrated score is defined as 



l + exp(A k .s* + B k y 



(5) 



where A^, are the learned parameters of the following logistic loss function: 



argmin^ti log^f + (1 - U) log(l - g%), 
l 

U = Or{W^Wi). 



A k ,B k 



(6) 
(7) 



Or(wi,W2) = |^u^| £ [0? 1] indicates the overlap score between two bounding 
boxes [28 , Wi is the ground-truth bounding box for the ith training sample, 
and indicates the predicted bounding box by the kth subcategory. In our 
experiments, we found this calibration step to help improve the performance 
(mean A. P. increase of 0.5% in the detection experiments). 



Sample Images 



PR on validation data 



Sigmoid 



Sample Images 



PR on validation data 



Sigmoid 



(a) 'Noisy' Subcategory 



(b) 'Good' Subcategory 



Fig. 3. The classifier trained on a noisy subcategory (horses with extreme occlusion 
and confusing texture) performs poorly on the validation dataset. As a result, its influ- 
ence is suppressed by the sigmoid. While a good subcategory (horses with homogeneous 
appearance) classifier leads to good performance on the validation data and hence its 
influence is boosted by the calibration step. 
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Table 1. Results on VOC2007. Rowl: result of row2: result using visual subcategories. row3: 
result of 1 1 1 1 with parts turned off i.e., using all the features at twice the spatial resolution of the 
root filter with no deformations. row4: same result using visual subcategories. 



4 Experimental Analysis 

We performed our analysis on the PASCAL VOC 2007 comp3 challenge dataset 
and protocol [9]. Table [l] summarizes our key results. Rowl shows the (baseline) 
result of the DPM detector [11]. Row2 shows the result obtained by using vi- 
sual subcategories (with if =15) in the DPM detector. It surpasses the baseline 
by 2.3% on average across the 20 VOC classes (the mean A. P. improves from 
32.3% to 34.6%). Figure [I] shows the top detections obtained per subcategory for 
horse and train categories. The individual detectors do a good job at localizing 
instances of their respective subcategories. In Figure [5j the discovered subcat- 
egories for symmetric (pottedplant) and deformable (cat) classes are displayed. 
The subcategories obtained for all of the 20 VOC classes are displayed in the 
supplementary material. 

Rows 3,4 of table [l] show the results obtained by turning off the deformable 
parts. More specifically, rather than sampling 'parts' from the high-resolution 
HOG template (sampled at twice the spatial resolution relative to the features 
captured by the root template) and modeling the deformation amongst them, we 
directly use all the features from the high-resolution template. This update to 
the DPM detector results in a simple multi-scale (two-level pyramid) representa- 
tion with the finer resolution catering towards improved feature localization. We 
observe that using this two-level pyramid representation for the visual subcate- 
gories yields a mean A. P. of 30.9% that is almost on par as the full deformable 
parts baseline (32.3%). This result becomes intiutive from the observation that 
instances within each of the subcategory are well-aligned (see figure [5] and sup- 
plementary material), and thus simpler models (without deformations) would 
suffice for training discriminative detectors. For instance, in case of rigid objects 
such as pottedplants, tvmonitors and trains, the use of part deformations does 
not offer any improvement over using the multi-scale visual subcategory detec- 
tor. For a few classes though, such as person and sofa, part deformations seem to 
be useful, while for some others, such as dining-table and sheep, the multi-scale 
visual subcategories actually performs better. These observations suggest that, 
in practice, the relatively simple concept of visual subcategories is as important 

1 Even though the bias term bk is used in equation |l]) to make the scores of multiple 
classifiers comparable, we have found that it is possible for some of the subcategories 
to be very noisy (specifically when K is large), in which case their output scores 
cannot be compared directly with other, more reliable ones. 
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Average HOG 
linage template 



Top 5 detections 




HOG Average 
template Image 



Category: Horse 

Top 5 detections 




Category: Train 

Fig. 4. As the intra-class variance within subcategories is low, the learned detectors 
perform quite well at localizing instances of their respective subcategories. Notice that 
for the same aspect-ratio and viewpoint, there are two different subcategories (rows 
4,5) discovered for the train category. 



How important are "Deformable Parts" in the Deformable Parts Model? 



9 



Sample Instances 




Sample Instances 



1! "IS | 


S Is 


□■ 










H 




■ 



Fig. 5. The visual subcategories discovered for pottedplants correspond to different camera view- 
points, while cats are partitioned based on their pose. The baseline system based on the aspect- 
ratio, left-right flipping heuristic cannot capture such distinctions (as many of the subcategories 
share the same aspect-ratio and are symmetric). 
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as the use of deformable parts in the DPM detector. 

Computational Issues. In terms of computational complexity, the two-scale 
visual subcategory detector (if =15) involves one coarse (root) and one fine reso- 
lution template per subcategory, totaling a sum of 30 HOG templates. Whereas 
the DPM detector has K=6 subcategories each with one root and eight part 
templates, totaling 54 HOG templates, which need to be convolved at test time. 
In terms of model learning, the DPM detector has the subcategory, as well as the 
part deformation parameters (four) as latent variables (for each of the 48 parts), 
while the visual subcategory detector only has the subcategory label as latent. 
Therefore it not only requires fewer rounds of latent training than required by 
the DPM detector (leading to faster convergence), but also is less susceptible to 
getting stuck in a bad local minima [29]. As emphasized in [2 , simpler models 
are preferable, as they can perform better in practice than rich models, which 
often suffer from difficulties in training. 

Number of subcategories. One important parameter is the number of sub- 
categories K. We analyze the influence of K by using different values (K = 
[3,6,9,12,15,20,25,50,100]) for a few classes ('boat', 'dog', 'train', 'tv') on the 
validation set. We plot the variation in the performance over different K in 
figure [6j The performance gradually increases with increasing K, but stabilizes 
around K=15. We used K = 15 in all the detection experiments. 
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Fig. 6. Variation in detection accuracy as a function of number of clusters for four 
distinct VOC2007 classes. The A.P. gradually increases with increasing number of 
clusters and stabilizes beyond a point. 



Initialization. Proper initialization of clusters is a key requirement for the 
success of latent variable models. We analyzed the importance of appearance- 
based initialization by comparing it with the aspect-ratio based initialization 
of [2]. Simply increasing the number of aspect-ratio based clusters leads to a 
decrease in performance (mean A.P drops by 1.2%), while for the same number 
(if =15), appearance-based clustering helps improve the mean A.P. by 2.3%. 

We noticed minimal variation in the final performance on multiple runs with 
different Kmeans initialization. We found the (latent) discriminative reclustering 
step helps in cleaning up any mistakes of the initialization step. (Also we observed 
that most of the reclustering happens in the first latent update.) 
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5 Application: Visual Subcategories for Scene 
Classification 

Another scenario where the problem of high intra-class variability is witnessed is 
scene classification. Scene categories exhibit a large range of visual diversity due 
to significant variation in camera viewpoint and scene structure. For example, 
when we refer to the scene category 'coast' (from [30]), it could contain images 
of rocky shores, sunsets, cloudy beaches, or calm waters. From our analysis of 
visual subcategories on the object detection dataset, we could expect that their 
use could also aid in simplifying the learning task for scene classification. 

Dataset details. We use the Scene Understanding (SUN) database for our 
scene classification experiments. The SUN database is a collection of about 
100,000 images organized into an exhaustive set of 899 scene categories [30] . 
For our experiments, we use the subset of 397 well-sampled categories. These 
397 fine-grained scene categories are arranged in a 3-level tree: with 397 leaf 
nodes (subordinate categories) connected to 15 parent nodes at the second level 
(basic-level categories) that are in turn connected to 3 nodes at the third level 
(superordinate categories) with the root node at the top. This hierarchy was not 
considered in the original experimental evaluations in [30] but used as a human 
organizational tool (in order to facilitate the annotation process e.g., annotators 
navigate through the three-level hierarchy to arrive at a specific scene type (e.g. 
'bedroom') by making relatively easy choices (e.g. 'indoor' versus 'outdoor' at 
the higher level)). 

Our goal is to train a classifier that can identify images as belonging to one of 
the 15 basic-level categories We use the images from all the subordinate cate- 
gories in a basic-level category to build the data corresponding to that basic-level 
category. The data was split into half training and half testing. The classifiers 
are all trained in a 'one-vs-all' fashion where instances belonging to a specific 
category are considered positive examples, and the rest of the instances (belong 
to all the other categories except the chosen one) serve as negative examples in 
the training process. While training subcategory classifiers, instances belonging 
to a particular subcategory (within a category) are treated as positives and the 
rest of the instances belong to that category are ignored (treated as dorCt care 
examples). The number of subcategories K was set to be 50 (to tackle the larger 
intra-category diversity in this dataset). We evaluate performance using the A. P. 
metric as used in PASCAL VOC image classification task [9]. 

We use the GIST feature representation (using the implementation of [31]), 
that has been well-studied in literature for scene classification experiments (e.g., [8]). 
We create this descriptor for each image at a 10 x 10 grid resolution where each 
bin contains that image patch's average response to steerable filters at 8 ori- 

2 The 15 basic- level categories are 'shoppingNdining', 'workplace', 'homeNhotel', 'vehi- 
clelnterior', 'sportsNleisure', 'cultural', 'waterNsnow', ' mount ainsNdesert', 'forestN- 
field', 'transportation', 'historicalPlace', 'parks', 'industrial', 'housesNgar dens', 'com- 
mercialMarkets' . 
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entations and 4 scales. We acknowledge that a classification system based on 
this representation is unlikely to beat the prevailing state-of-the-art. Multiple 
previous methods have shown that classification performance is significantly im- 
proved by the use of BOW-models with densely sampled feature points along 
with multiple sets of feature descriptors, and the use of spatial pyramids. We 
chose the simple GIST-based representation for our analysis in this paper as our 
focus is not to argue in favor of a new classification method, but to show the 
benefits of the visual subcategory concept using a generic and simple framework. 

5.1 Visual subcategories are semantically interpretable 

Our approach based on visual subcategories achieves a score of 27.1% confidently 
outperforming the baseline linear SVM 16.9% (see Table in supplementary mate- 
rial for results per category). The utility of our approach becomes more evident 
as we take a closer look at the classification results of discovered subcategories 
(see Figure [7]). Many of the subcategories discovered correspond to the seman- 
tic subordinate categories. For example, the basic-level category 'vehiclelnterior' 
contains clusters for 'cockpit', 'bus interior', 'car front seat', and 'car back seat' 
that all correspond to the fine-grained categories constituting this basic-level cat- 
egory. Subsequently, this allows deeper reasoning about the image rather than 
simply assigning the category label. For e.g., instead of simply classifying an 
image as vehiclelnterior, we could now say that it is a 'cockpit' image. 

5.2 Visual subcategories alleviate the need for human supervision 

Given the above result, we seek to quantitatively analyze the benefit of gather- 
ing human-annotated subordinate categories over the unsupervisedly discovered 
visual subcategories. To this end, we ran an experiment where the subcategories 
in our framework are initialized using the ground-truth subordinate categories. 
The result obtained using this initialization (mean A. P. of 27.2%) is very similar 
to that obtained using our unsupervised subcategories of 27.1% (see Table in 
supplementary material for results per category). This is interesting because it 
indicates that human supervision for creating the fine-grained subcategories to 
train a basic-level category classifier may not be of great benefit compared to 
the unsupervised visual subcategories. Our observations here are also supported 
by the recent findings in [8], wherein semantic similarity was found to be cor- 
related to visual similiarity at the bottom of the ImageNet [7. hierarchy i.e., 
when the basic- level category is sliced into extremely small subsets. However to 
acquire these fine-grained subcategories, one needs to expend significant amount 
of human annotation effort. 

6 Conclusion 

Contrary to the existing belief that deformable parts is the key contribution 
for the success of the deformable parts model detector, we have found that 
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Fig. 7. SUN397 Scene-Image Classification Results: Scene-Image categories ex- 
hibit a large visual diversity due to significant variation in camera viewpoint and 
scene structure. The 'vehiclelnterior' classifier contains separate subcategories 
for cockpit, bus interior, car front seat and car back seat. 'commercialMarket' 
is composed of different types of buildings, skyscrapers, and street/alley scenes. 
The 'industrial' category has water towers, oil rigs, land fills, and outdoor indus- 
trial scenes. Finally the category 'park' has baseball fields, carousals, outdoor 
tennis fields as subcategories. It is interesting to note that using a completely un- 
supervised approach, it is possible to discover the subcategories that mostly cor- 
respond to the human annotated fine-grained categories of the SUN397 dataset 
(even subtle ones such as car frontseat, car backseat). 
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the use of subcategories can potentially alleviate their need. The use of visual 
subcategories not only benefits model learning and performance, but also leads 
to simpler and more interpretable models. In addition to object detection, their 
use can also benefit the scene classification task as it can alleviate the need for 
human supervision in carving the space of fine-grained subordinate categories. 
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